Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
نویسندگان
چکیده
This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of ‘light’ and ‘hard’ comparable corpora is introduced. At this stage we aim at producing a ‘light’ bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.
منابع مشابه
The verbal prefix o(b)– in Croatian and Bulgarian: The semantic network and challenges of a corpus–based study
This study compares the verbal prefix o(b)– in two South Slavic languages, Croatian and Bulgarian, from a cognitive linguistic perspective. We focus on the problems arising when constructing the semantic network of this polysemous prefix, particularly on 1) isolating the prefix’s meaning from the meaning of the base verb and 2) identifying core/dominant sub–meanings for all verbs and giving the...
متن کاملMULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...
متن کاملExtracting Lay Paraphrases of Specialized Expressions from Monolingual Comparable Medical Corpora
Whereas multilingual comparable corpora have been used to identify translations of words or terms, monolingual corpora can help identify paraphrases. The present work addresses paraphrases found between two different discourse types: specialized and lay texts. We therefore built comparable corpora of specialized and lay texts in order to detect equivalent lay and specialized expressions. We ide...
متن کاملUtilizing Citations of Foreign Words in Corpus-Based Dictionary Generation
Previous work concerned with the identification of word translations from text collections has been either based on parallel or on comparable corpora of the respective languages. In the case of comparable corpora basic dictionaries have been necessary to form a bridge between the languages under consideration. We present here a novel approach to identify word translations from a single monoling...
متن کاملProducing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool fo...
متن کامل